This document contains time series analysis for the DSAN5100 final project.
Load data files and preprocess
# Load required libraries for Rlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)library(arrow)
Attaching package: 'arrow'
The following object is masked from 'package:lubridate':
duration
The following object is masked from 'package:utils':
timestamp
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:arrow':
schema
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
library(forecast)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(imputeTS)
Attaching package: 'imputeTS'
The following object is masked from 'package:zoo':
na.locf
library(tseries)
Attaching package: 'tseries'
The following object is masked from 'package:imputeTS':
na.remove
library(patchwork)library(astsa)
Attaching package: 'astsa'
The following object is masked from 'package:forecast':
gas
# Set Arrow option to skip null charactersoptions(arrow.skip_nul =TRUE)# Set seed for reproducibilityset.seed(42)
# A tibble: 6 × 16
ID publishedAt `source-name` location_code location category year month
<int> <chr> <chr> <chr> <chr> <chr> <int> <int>
1 816 2020-08-07T11… Minneapolis … us United … general 2020 8
2 901 2020-08-07T11… Google News us United … general 2020 8
3 961 2020-08-07T11… CNBC us United … general 2020 8
4 1054 2020-08-07T12… Google News us United … general 2020 8
5 1147 2020-08-07T12… CNET us United … general 2020 8
6 1335 2020-08-07T13… The Washingt… us United … general 2020 8
# ℹ 8 more variables: new_title <chr>, neg <dbl>, neu <dbl>, pos <dbl>,
# compound <dbl>, sentiment_category <chr>, bias_category <chr>,
# bias_score <dbl>
# Change publishedAt column into a date type# Function to convert publishedAt to date typeconvert_date_column <-function(df) { df <- df %>%mutate(publishedAt =as.Date(publishedAt))return(df)}# Apply date conversion to all dataframesargentina_df <-convert_date_column(argentina_df)canada_df <-convert_date_column(canada_df)china_df <-convert_date_column(china_df)india_df <-convert_date_column(india_df)italy_df <-convert_date_column(italy_df)russia_df <-convert_date_column(russia_df)us_df <-convert_date_column(us_df)# Check the data type conversioncat("Date type conversion completed:\n")
Date type conversion completed:
cat("publishedAt column type for US data:", class(us_df$publishedAt), "\n")
publishedAt column type for US data: Date
# Check date range for various country datacat("Date range for US data:", as.character(range(us_df$publishedAt, na.rm =TRUE)), "\n")
Date range for US data: 2015-04-01 2021-11-29
cat("Date range for Argentina data:", as.character(range(argentina_df$publishedAt, na.rm =TRUE)), "\n")
Date range for Argentina data: 2016-10-02 2021-11-29
cat("Date range for Canada data:", as.character(range(canada_df$publishedAt, na.rm =TRUE)), "\n")
Date range for Canada data: 2013-12-26 2021-11-29
cat("Date range for China data:", as.character(range(china_df$publishedAt, na.rm =TRUE)), "\n")
Date range for China data: 2019-11-23 2021-11-29
cat("Date range for India data:", as.character(range(india_df$publishedAt, na.rm =TRUE)), "\n")
Date range for India data: 2015-04-25 2021-11-29
cat("Date range for Italy data:", as.character(range(italy_df$publishedAt, na.rm =TRUE)), "\n")
Date range for Italy data: 2016-07-10 2021-11-29
cat("Date range for Russia data:", as.character(range(russia_df$publishedAt, na.rm =TRUE)), "\n")
Date range for Russia data: 2010-11-24 2021-11-29
All of the countries have an end date of 2021-11-29 after which the data was no longer collected. The earliest data for the published articles range from 2010-11-24 for Russia to 2019-11-23 for China
# Combine the country datasets into df_countries# Set Arrow option to handle null charactersoptions(arrow.skip_nul =TRUE)df_countries <-bind_rows( us_df, canada_df, china_df, india_df, italy_df, russia_df, argentina_df)
Warning in vec_rbind(!!!dots, .names_to = .id, .error_call = current_env()):
Stripping '\0' (nul) from character vector
Warning in vec_rbind(!!!dots, .names_to = .id, .error_call = current_env()):
Stripping '\0' (nul) from character vector
Warning in vec_rbind(!!!dots, .names_to = .id, .error_call = current_env()):
Stripping '\0' (nul) from character vector
Countries included: United States Canada China India Italy Russian Federation Argentina
# Filter the dataset to start from 2020 for more focused analysisdf_countries_2020 <- df_countries %>%filter(year >2019)
Analysis Health, General, Science categories combined
In to capture the sentiment around COVID-19 and also leverage the scarse related data in the dataset, we have combined the Health, Science, and General category to explore the sentiment and bias during the COVID-19 pandemic (2020-02 to 2021-11)
# Combined time series plot for sentiment and bias across health, general, and science categories, daily# Prepare daily combined data for USA with both sentiment and bias for health, general, and science categoriesusa_combined_daily <- df_countries %>%filter(location =="United States", category %in%c("health", "general", "science", "business")) %>%group_by(publishedAt) %>%summarise(avg_sentiment =mean(compound, na.rm =TRUE),avg_bias =mean(bias_score, na.rm =TRUE),n_articles =n(),.groups ='drop' ) %>%rename(date = publishedAt) %>%# Transform to long format for plottingpivot_longer(cols =c(avg_sentiment, avg_bias),names_to ="metric",values_to ="score" ) %>%mutate(metric =case_when( metric =="avg_sentiment"~"Sentiment", metric =="avg_bias"~"Bias" ) )head(usa_combined_daily)
cat("Monthly data range:", as.character(range(usa_combined_monthly$year_month)), "\n")
Monthly data range: 2015-04-01 2021-11-01
cat("Number of months:", nrow(usa_combined_monthly), "\n")
Number of months: 32
Fill in missing values
# Fill in missing data using Kalman filter# Fill in missing data for daily df# Create a complete date sequence to identify missing daysdate_range <-seq(from =min(usa_combined_daily_transformed$date), to =max(usa_combined_daily_transformed$date), by ="day")# Create complete daily dataframe with all datesusa_daily_complete <-data.frame(date = date_range) %>%left_join(usa_combined_daily_transformed, by ="date")# Check for missing valuescat("Missing values in daily data:\n")
# Fill in missing data for monthly df# Create a complete monthly sequencemonthly_range <-seq(from =floor_date(min(usa_combined_monthly$year_month), "month"),to =floor_date(max(usa_combined_monthly$year_month), "month"),by ="month")# Create complete monthly dataframe with all monthsusa_monthly_complete <-data.frame(year_month = monthly_range) %>%left_join(usa_combined_monthly, by ="year_month")# Check for missing valuescat("\nMissing values in monthly data:\n")
# Apply Kalman filter to fill missing values (if any)# Check if we actually have missing values before applying Kalman filterif(any(is.na(usa_monthly_complete$sentiment)) ||any(is.na(usa_monthly_complete$bias))) {cat("Applying Kalman filter to fill missing monthly values...\n") usa_monthly_filled <- usa_monthly_complete %>%mutate(sentiment_filled =na_kalman(sentiment, model ="StructTS", smooth =TRUE),bias_filled =na_kalman(bias, model ="StructTS", smooth =TRUE) ) %>%select(year_month, sentiment = sentiment_filled, bias = bias_filled)} else {cat("No missing values found in monthly data - using original data...\n") usa_monthly_filled <- usa_monthly_complete %>%select(year_month, sentiment, bias)}
Applying Kalman filter to fill missing monthly values...
# Convert daily df into a time series# Extract the start date for daily time seriesdaily_start_date <-as.Date(min(usa_daily_filled$date))daily_start_year <-as.numeric(format(daily_start_date, "%Y"))daily_start_yday <-as.numeric(format(daily_start_date, "%j"))# Create daily time series objectssentiment_daily_ts <-ts(usa_daily_filled$sentiment, start =c(daily_start_year, daily_start_yday), frequency =365.25)bias_daily_ts <-ts(usa_daily_filled$bias, start =c(daily_start_year, daily_start_yday), frequency =365.25)# Display basic info about the time seriescat("Daily Time Series Information:\n")
# Convert monthly df into a time series# Extract the start date for monthly time seriesmonthly_start_date <-as.Date(min(usa_monthly_filled$year_month))monthly_start_year <-as.numeric(format(monthly_start_date, "%Y"))monthly_start_month <-as.numeric(format(monthly_start_date, "%m"))# Create monthly time series objectssentiment_monthly_ts <-ts(usa_monthly_filled$sentiment, start =c(monthly_start_year, monthly_start_month), frequency =12)bias_monthly_ts <-ts(usa_monthly_filled$bias, start =c(monthly_start_year, monthly_start_month), frequency =12)# Display basic info about the time seriescat("Monthly Time Series Information:\n")
# Difference the datadiff_sentiment_daily <-diff(sentiment_daily_ts)# Plot the differenced datap1 <-ggplot() +geom_line(aes(x =time(diff_sentiment_daily), y =as.numeric(diff_sentiment_daily)), color ="steelblue") +labs(title ="Differenced Daily Sentiment - USA Combined Categories",x ="Time",y ="Differenced Sentiment Score") +theme_minimal()# Display the Differenced Plotp1
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
# Check stationarity of differenced seriescat("ADF Test for Differenced Daily Sentiment:\n")
ADF Test for Differenced Daily Sentiment:
adf_diff_daily_sentiment <-adf.test(diff_sentiment_daily, alternative ="stationary")
Warning in adf.test(diff_sentiment_daily, alternative = "stationary"): p-value
smaller than printed p-value
print(adf_diff_daily_sentiment)
Augmented Dickey-Fuller Test
data: diff_sentiment_daily
Dickey-Fuller = -10.18, Lag order = 13, p-value = 0.01
alternative hypothesis: stationary
cat("Interpretation: ")
Interpretation:
if(adf_diff_daily_sentiment$p.value <0.05) {cat("Differenced series is STATIONARY (p-value < 0.05)\n")} else {cat("Differenced series is still NON-STATIONARY (p-value >= 0.05)\n")}
Differenced series is STATIONARY (p-value < 0.05)
# No need to difference bias data as the series is stationary
# Difference the datadiff_sentiment_monthly <-diff(sentiment_monthly_ts)# Plot the differenced datap3 <-ggplot() +geom_line(aes(x =time(diff_sentiment_monthly), y =as.numeric(diff_sentiment_monthly)), color ="darkblue") +labs(title ="Differenced Monthly Sentiment - USA Combined Categories",x ="Time",y ="Differenced Sentiment Score") +theme_minimal()# Display the Differenced Plotp3
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
# Check stationarity of differenced seriescat("ADF Test for Differenced Monthly Sentiment:\n")
ADF Test for Differenced Monthly Sentiment:
adf_diff_monthly_sentiment <-adf.test(diff_sentiment_monthly, alternative ="stationary")
Warning in adf.test(diff_sentiment_monthly, alternative = "stationary"):
p-value smaller than printed p-value
print(adf_diff_monthly_sentiment)
Augmented Dickey-Fuller Test
data: diff_sentiment_monthly
Dickey-Fuller = -4.6235, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
cat("Interpretation: ")
Interpretation:
if(adf_diff_monthly_sentiment$p.value <0.05) {cat("Differenced series is STATIONARY (p-value < 0.05)\n")} else {cat("Differenced series is still NON-STATIONARY (p-value >= 0.05)\n")}
Differenced series is STATIONARY (p-value < 0.05)
# Difference the datadiff_bias_monthly <-diff(bias_monthly_ts)# Plot the differenced datap4 <-ggplot() +geom_line(aes(x =time(diff_bias_monthly), y =as.numeric(diff_bias_monthly)), color ="darkred") +labs(title ="Differenced Monthly Bias - USA Combined Categories",x ="Time",y ="Differenced Bias Score") +theme_minimal()# Display the Differenced Plotp4
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
# Check stationarity of differenced seriescat("ADF Test for Differenced Monthly Bias:\n")
ADF Test for Differenced Monthly Bias:
adf_diff_monthly_bias <-adf.test(diff_bias_monthly, alternative ="stationary")
Warning in adf.test(diff_bias_monthly, alternative = "stationary"): p-value
smaller than printed p-value
print(adf_diff_monthly_bias)
Augmented Dickey-Fuller Test
data: diff_bias_monthly
Dickey-Fuller = -5.6506, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
cat("Interpretation: ")
Interpretation:
if(adf_diff_monthly_bias$p.value <0.05) {cat("Differenced series is STATIONARY (p-value < 0.05)\n")} else {cat("Differenced series is still NON-STATIONARY (p-value >= 0.05)\n")}
Differenced series is STATIONARY (p-value < 0.05)
# Summary of differencing resultscat("\n=== Summary of Differencing Results ===\n")
The ACF and PACF plot of the differenced daily sentiment show remaining autocorrelation autocorrelation, hence the need for second order differencing
The ACF and PACF plot of the differenced monthly sentiment and bias show that most autocorrelations fall within the significance bounds after the first few lags
Second order differencing for daily sentiment
# Second-order differencing for daily sentimentdiff2_sentiment_daily <-diff(diff_sentiment_daily)# Plot the second-order differenced datap_diff2 <-ggplot() +geom_line(aes(x =time(diff2_sentiment_daily), y =as.numeric(diff2_sentiment_daily)), color ="steelblue") +labs(title ="Second-Order Differenced Daily Sentiment - USA Combined Categories",x ="Time",y ="Second-Order Differenced Sentiment Score") +theme_minimal()# Display the plotp_diff2
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
# Check stationarity of second-order differenced seriescat("ADF Test for Second-Order Differenced Daily Sentiment:\n")
ADF Test for Second-Order Differenced Daily Sentiment:
adf_diff2_daily_sentiment <-adf.test(diff2_sentiment_daily, alternative ="stationary")
Warning in adf.test(diff2_sentiment_daily, alternative = "stationary"): p-value
smaller than printed p-value
print(adf_diff2_daily_sentiment)
Augmented Dickey-Fuller Test
data: diff2_sentiment_daily
Dickey-Fuller = -25.216, Lag order = 13, p-value = 0.01
alternative hypothesis: stationary
cat("Interpretation: ")
Interpretation:
if(adf_diff2_daily_sentiment$p.value <0.05) {cat("Second-order differenced series is STATIONARY (p-value < 0.05)\n")} else {cat("Second-order differenced series is still NON-STATIONARY (p-value >= 0.05)\n")}
Second-order differenced series is STATIONARY (p-value < 0.05)
# ACF and PACF plots for second-order differenced daily sentiment# ACF plot for second-order differenced daily sentimentacf_diff2_daily_sentiment <-ggAcf(diff2_sentiment_daily, 50) +ggtitle("ACF of Second-Order Differenced Daily Sentiment") +theme_minimal()# PACF plot for second-order differenced daily sentimentpacf_diff2_daily_sentiment <-ggPacf(diff2_sentiment_daily, 50) +ggtitle("PACF of Second-Order Differenced Daily Sentiment") +theme_minimal()# Display both plots side by sidecat("ACF and PACF Plots for Second-Order Differenced Daily Sentiment:\n")
ACF and PACF Plots for Second-Order Differenced Daily Sentiment:
# Comparison of first-order vs second-order differencing ACF plotscat("Comparison: First-Order vs Second-Order Differencing ACF:\n")
Comparison: First-Order vs Second-Order Differencing ACF:
acf_daily_sentiment / acf_diff2_daily_sentiment
The ACF and PACF of the second order differencing show that the series may be over differenced as evidenced by potential negative correlation patterns
For modeling, we will be predicting monthly sentiment and bias hence the following parameter choices and modeling exercises will be focus on monthly sentiment and bias
Parameter choices - d = 1 since first order differencing is sufficient for stationarity - p (AR order): From the PACF plots, we see significant spikes at lags 1 and 2 so we will test p = 0, 1, 2 - q (MA order) : From the ACF plot, we see a signficant correlation at lag 1 so we will test q = 0,1,2
Parameter search for monthly sentiment
# Model 1: ARIMA (0,1,0)model1 <-Arima(sentiment_monthly_ts, order =c(0, 1, 0), include.drift =FALSE)summary(model1)
Series: sentiment_monthly_ts
ARIMA(0,1,0)
sigma^2 = 0.02653: log likelihood = 31.26
AIC=-60.52 AICc=-60.47 BIC=-58.15
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.008521767 0.161873 0.08038175 NaN Inf 0.42168 -0.3911877
# Model 2: ARIMA (1,1,0)model2 <-Arima(sentiment_monthly_ts, order =c(1, 1, 0), include.drift =FALSE)summary(model2)
Series: sentiment_monthly_ts
ARIMA(1,1,0)
Coefficients:
ar1
-0.4722
s.e. 0.1117
sigma^2 = 0.02188: log likelihood = 39.25
AIC=-74.5 AICc=-74.34 BIC=-69.76
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01166505 0.1460697 0.07906747 NaN Inf 0.4147853 -0.04152011
# Model 3: ARIMA (2,1,0)model3 <-Arima(sentiment_monthly_ts, order =c(2, 1, 0), include.drift =FALSE)summary(model3)
Series: sentiment_monthly_ts
ARIMA(2,1,0)
Coefficients:
ar1 ar2
-0.5970 -0.3040
s.e. 0.1181 0.1181
sigma^2 = 0.02042: log likelihood = 42.41
AIC=-78.83 AICc=-78.51 BIC=-71.72
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01394514 0.1401805 0.07923248 NaN Inf 0.4156509 0.06257796
# Model 4: ARIMA (0,1,1)model4 <-Arima(sentiment_monthly_ts, order =c(0, 1, 1), include.drift =FALSE)summary(model4)
Series: sentiment_monthly_ts
ARIMA(0,1,1)
Coefficients:
ma1
-0.5574
s.e. 0.0939
sigma^2 = 0.02054: log likelihood = 41.69
AIC=-79.37 AICc=-79.21 BIC=-74.63
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01578316 0.1415283 0.07938302 NaN Inf 0.4164407 0.03344299
# Model 5: ARIMA (0,1,2)model5 <-Arima(sentiment_monthly_ts, order =c(0, 1, 2), include.drift =FALSE)summary(model5)
Series: sentiment_monthly_ts
ARIMA(0,1,2)
Coefficients:
ma1 ma2
-0.6394 0.1344
s.e. 0.1281 0.1363
sigma^2 = 0.02056: log likelihood = 42.15
AIC=-78.29 AICc=-77.97 BIC=-71.19
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01450894 0.1406629 0.08036869 NaN Inf 0.4216115 0.1006025
# Model 6: ARIMA (1,1,1)model6 <-Arima(sentiment_monthly_ts, order =c(1, 1, 1), include.drift =FALSE)summary(model6)
Series: sentiment_monthly_ts
ARIMA(1,1,1)
Coefficients:
ar1 ma1
-0.1549 -0.4603
s.e. 0.1931 0.1628
sigma^2 = 0.02065: log likelihood = 41.99
AIC=-77.97 AICc=-77.65 BIC=-70.87
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01512532 0.1409653 0.07995176 NaN Inf 0.4194242 0.08095608
# Model 7: ARIMA (1,1,2)model7 <-Arima(sentiment_monthly_ts, order =c(1, 1, 2), include.drift =FALSE)summary(model7)
Series: sentiment_monthly_ts
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
0.3016 -0.9289 0.3052
s.e. 0.5271 0.4912 0.2762
sigma^2 = 0.02075: log likelihood = 42.29
AIC=-76.59 AICc=-76.05 BIC=-67.11
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01372643 0.140405 0.08038007 NaN Inf 0.4216711 0.0928004
# Model 8: ARIMA (2,1,2)model8 <-Arima(sentiment_monthly_ts, order =c(2, 1, 2), include.drift =FALSE)summary(model8)
Series: sentiment_monthly_ts
ARIMA(2,1,2)
Coefficients:
ar1 ar2 ma1 ma2
-0.1234 -0.2373 -0.4890 0.2144
s.e. 0.8119 0.2384 0.8094 0.4391
sigma^2 = 0.02085: log likelihood = 42.62
AIC=-75.23 AICc=-74.41 BIC=-63.39
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.01382598 0.1398063 0.07972581 NaN Inf 0.4182389 0.079746
Model 1 : ARIMA (0,1,0) has the lowest AIC indicating to be the best model according to this manual selection process
Loop based approach: Breaking terminal in VS Code
# Parameter search for monthly sentiment# Load necessary librarieslibrary(knitr)# Define parameter ranges based on ACF/PACF analysisp_range <-0:2# AR order: test p = 0, 1, 2 based on PACF analysisd <-1# Differencing: d = 1 since first order differencing is sufficientq_range <-0:2# MA order: test q = 0, 1, 2 based on ACF analysis# Calculate total combinationsn_combinations <-length(p_range) *length(q_range)cat("Testing", n_combinations, "ARIMA model combinations for monthly sentiment...\n\n")
Testing 9 ARIMA model combinations for monthly sentiment...
# Create an empty matrix to store resultsresults_matrix <-matrix(NA, nrow = n_combinations, ncol =6)
# Initialize index for matrix rowi <-1cat("Starting ARIMA parameter search...\n")
Starting ARIMA parameter search...
# Loop through combinations of ARIMA model parametersfor (p in p_range) {for (q in q_range) {cat("Testing ARIMA(", p, ",", d, ",", q, ")... ")# Try fit ARIMA model with comprehensive error handlingtryCatch({# Check if the data has enough observationsif(length(sentiment_monthly_ts) < (p + q + d +1)) {cat("INSUFFICIENT DATA\n") results_matrix[i, ] <-c(p, d, q, NA, NA, NA) } else {# Fit ARIMA model with specified (p,d,q) for monthly sentiment model <-Arima(sentiment_monthly_ts, order =c(p, d, q), include.drift =FALSE)# Store model parameters and AIC/BIC/AICc values in matrix results_matrix[i, ] <-c(p, d, q, model$aic, model$bic, model$aicc)# Print successcat("SUCCESS - AIC:", round(model$aic, 3), "\n") } }, warning =function(w) {cat("WARNING:", w$message, "\n") results_matrix[i, ] <-c(p, d, q, NA, NA, NA) }, error =function(e) {# If model fails to fit, store NA values and print errorcat("FAILED -", e$message, "\n") results_matrix[i, ] <-c(p, d, q, NA, NA, NA) })# Increment row index i <- i +1# Add a small delay to prevent overwhelming the systemSys.sleep(0.1) }}
# Convert matrix to data frameresults_df <-as.data.frame(results_matrix)colnames(results_df) <-c("p", "d", "q", "AIC", "BIC", "AICc")# Remove rows with NA values results_df_clean <- results_df[complete.cases(results_df), ]cat("Successfully fitted models:", nrow(results_df_clean), "out of", nrow(results_df), "\n")
Successfully fitted models: 9 out of 9
if(nrow(results_df_clean) >0) {# Find the row with the lowest AIC, BIC, and AICc best_aic_row <-which.min(results_df_clean$AIC) best_bic_row <-which.min(results_df_clean$BIC) best_aicc_row <-which.min(results_df_clean$AICc)# Display summary of best modelscat("\n=== Best Model Selection Results ===\n")cat("Best model by AIC: ARIMA(", results_df_clean[best_aic_row, "p"], ",", results_df_clean[best_aic_row, "d"], ",", results_df_clean[best_aic_row, "q"], ") - AIC =", round(results_df_clean[best_aic_row, "AIC"], 3), "\n")cat("Best model by BIC: ARIMA(", results_df_clean[best_bic_row, "p"], ",", results_df_clean[best_bic_row, "d"], ",", results_df_clean[best_bic_row, "q"], ") - BIC =", round(results_df_clean[best_bic_row, "BIC"], 3), "\n")cat("Best model by AICc: ARIMA(", results_df_clean[best_aicc_row, "p"], ",", results_df_clean[best_aicc_row, "d"], ",", results_df_clean[best_aicc_row, "q"], ") - AICc =", round(results_df_clean[best_aicc_row, "AICc"], 3), "\n")# Display the results table with formattingcat("\n") knitr::kable(results_df_clean, align ='c', caption ="ARIMA Model Comparison for Monthly Sentiment",digits =3)} else {cat("ERROR: No models were successfully fitted. Check your data.\n")print(summary(sentiment_monthly_ts))}
=== Best Model Selection Results ===
Best model by AIC: ARIMA( 0 , 1 , 1 ) - AIC = -79.372
Best model by BIC: ARIMA( 0 , 1 , 1 ) - BIC = -74.633
Best model by AICc: ARIMA( 0 , 1 , 1 ) - AICc = -79.214
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Series: sentiment_monthly_ts
ARIMA(1,0,1) with zero mean
Coefficients:
ar1 ma1
0.9424 -0.5327
s.e. 0.0421 0.1056
sigma^2 = 0.02028: log likelihood = 42.86
AIC=-79.72 AICc=-79.41 BIC=-72.58
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.004185935 0.1406097 0.07172981 NaN Inf 0.3762922 -0.09366502
The residual plot show consistent fluctuation around zero and but no constant variation which shows that the model might need improvement slightly
The ACF plot does not reveal any significant correlations
The Q-Q plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails, which is typical in time series data
The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying no significant autocorrelations left in the residuals and concluding that the model is adequate
The residual plot does not show consistent fluctuation around zero and no constant variation which shows that the model might need improvement
The ACF plot does not reveal any significant correlations
The Q-Q plot indicates that the residuals follow a near-normal distribution, with major deviations at the tails,
The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying no significant autocorrelations left in the residuals and concluding that the model might be adequate
Both models perform similarly, with the residual, acf, and normal q-q plot almost identical. The only different plots are the Ljung-Box statistic plot with the automatic plot producing p-values that are significantly > 0.05
Benchmark methods and models for sentiment
# Define model orders for comparisonmanual_order <-c(0, 1, 0)auto_order <-c(1, 0, 1)
# Split data for evaluation (use last 6 observations for testing)n <-length(sentiment_monthly_ts)train_data <- sentiment_monthly_ts[1:(n-6)]test_data <- sentiment_monthly_ts[(n-5):n]# Fit manual model on training datamanual_arima <-Arima(train_data, order = manual_order, include.drift =FALSE)manual_forecast <-forecast(manual_arima, h =6)# Fit automatic model on training dataauto_arima <-Arima(train_data, order = auto_order, include.drift =FALSE)auto_forecast <-forecast(auto_arima, h =6)
# Benchmark methods# Naive methodnaive_forecast <-naive(train_data, h =6)# Mean method mean_forecast <-meanf(train_data, h =6)# Random walk with driftrwf_forecast <-rwf(train_data, h =6, drift =TRUE)# Calculate accuracy measuresmanual_accuracy <-accuracy(manual_forecast, test_data)auto_accuracy <-accuracy(auto_forecast, test_data)naive_accuracy <-accuracy(naive_forecast, test_data)mean_accuracy <-accuracy(mean_forecast, test_data)rwf_accuracy <-accuracy(rwf_forecast, test_data)# Create comparison tablecomparison <-data.frame(Method =c("Manual ARIMA", "Auto ARIMA", "Naive", "Mean", "RWF with Drift"),MAE =c(manual_accuracy[2,3], auto_accuracy[2,3], naive_accuracy[2,3], mean_accuracy[2,3], rwf_accuracy[2,3]),RMSE =c(manual_accuracy[2,2], auto_accuracy[2,2], naive_accuracy[2,2], mean_accuracy[2,2], rwf_accuracy[2,2]),MAPE =c(manual_accuracy[2,5], auto_accuracy[2,5], naive_accuracy[2,5], mean_accuracy[2,5], rwf_accuracy[2,5]))
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Model Performance Comparison
Method
MAE
RMSE
MAPE
Manual ARIMA
0.025
0.027
52.577
Auto ARIMA
0.075
0.080
167.259
Naive
0.025
0.027
52.577
Mean
0.155
0.156
403.982
RWF with Drift
0.009
0.014
28.218
The random walk with drift has a lower MAE, RMSE, and MAPE, outperforming both the Manual ARIMA and Auto Arima
Plot forecasts with manual arima as chosen model to include with baseline models as it outperforms auto arima
# Forecast visualization# Fit the best manual model on full databest_manual_fit <-Arima(sentiment_monthly_ts, order = manual_order, include.drift =FALSE)# Create forecast plot comparing all methodsautoplot(sentiment_monthly_ts) +autolayer(meanf(sentiment_monthly_ts, h =60), series ="Mean", PI =FALSE) +autolayer(naive(sentiment_monthly_ts, h =60), series ="Naïve", PI =FALSE) +autolayer(rwf(sentiment_monthly_ts, drift =TRUE, h =60), series ="RWF with Drift", PI =FALSE) +autolayer(forecast(best_manual_fit, h =60), series ="Manual ARIMA", PI =FALSE) +ggtitle("Sentiment Forecast Comparison") +xlab("Time") +ylab("Sentiment (health, science, and general category)") +guides(colour =guide_legend(title ="Forecast Methods")) +theme_minimal()
Parameter search for monthly bias
# Model 1: ARIMA (0,1,0)model1 <-Arima(bias_monthly_ts, order =c(0, 1, 0), include.drift =FALSE)summary(model1)
Series: bias_monthly_ts
ARIMA(0,1,0)
sigma^2 = 0.1704: log likelihood = -42.19
AIC=86.39 AICc=86.44 BIC=88.76
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01140549 0.4101878 0.2138057 -188.5115 482.6352 0.6365364
ACF1
Training set -0.5095393
# Model 2: ARIMA (1,1,0)model2 <-Arima(bias_monthly_ts, order =c(1, 1, 0), include.drift =FALSE)summary(model2)
Series: bias_monthly_ts
ARIMA(1,1,0)
Coefficients:
ar1
-0.5997
s.e. 0.1007
sigma^2 = 0.1191: log likelihood = -27.78
AIC=59.55 AICc=59.71 BIC=64.29
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01460045 0.3408022 0.1978034 -78.00665 276.4834 0.5888948
ACF1
Training set -0.1270015
# Model 3: ARIMA (2,1,0)model3 <-Arima(bias_monthly_ts, order =c(2, 1, 0), include.drift =FALSE)summary(model3)
Series: bias_monthly_ts
ARIMA(2,1,0)
Coefficients:
ar1 ar2
-0.7991 -0.4049
s.e. 0.1088 0.1120
sigma^2 = 0.1033: log likelihood = -21.79
AIC=49.58 AICc=49.9 BIC=56.69
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01659485 0.3153254 0.1989953 131.0169 297.7831 0.5924432
ACF1
Training set 0.02386103
# Model 4: ARIMA (0,1,1)model4 <-Arima(bias_monthly_ts, order =c(0, 1, 1), include.drift =FALSE)summary(model4)
Series: bias_monthly_ts
ARIMA(0,1,1)
Coefficients:
ma1
-0.7562
s.e. 0.0827
sigma^2 = 0.1034: log likelihood = -22.4
AIC=48.8 AICc=48.96 BIC=53.54
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.02437428 0.3175789 0.1949152 81.01677 183.9622 0.5802961
ACF1
Training set -0.03081449
# Model 5: ARIMA (0,1,2)model5 <-Arima(bias_monthly_ts, order =c(0, 1, 2), include.drift =FALSE)summary(model5)
Series: bias_monthly_ts
ARIMA(0,1,2)
Coefficients:
ma1 ma2
-0.8414 0.1776
s.e. 0.1056 0.1283
sigma^2 = 0.1025: log likelihood = -21.5
AIC=49 AICc=49.32 BIC=56.11
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01927906 0.3141134 0.1921457 103.312 227.0911 0.5720508
ACF1
Training set 0.05863614
# Model 6: ARIMA (1,1,1)model6 <-Arima(bias_monthly_ts, order =c(1, 1, 1), include.drift =FALSE)summary(model6)
Series: bias_monthly_ts
ARIMA(1,1,1)
Coefficients:
ar1 ma1
-0.2406 -0.6166
s.e. 0.1679 0.1422
sigma^2 = 0.1024: log likelihood = -21.47
AIC=48.93 AICc=49.25 BIC=56.04
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01998001 0.3139071 0.1927771 103.1747 226.0265 0.5739305
ACF1
Training set 0.06668672
# Model 7: ARIMA (1,1,2)model7 <-Arima(bias_monthly_ts, order =c(1, 1, 2), include.drift =FALSE)summary(model7)
Series: bias_monthly_ts
ARIMA(1,1,2)
Coefficients:
ar1 ma1 ma2
-0.9871 0.2063 -0.6736
s.e. 0.0224 0.0973 0.0949
sigma^2 = 0.09993: log likelihood = -20.31
AIC=48.62 AICc=49.16 BIC=58.1
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.02120117 0.308111 0.2029017 9.052154 214.1556 0.6040732
ACF1
Training set 0.03457281
# Model 8: ARIMA (2,1,2)model8 <-Arima(bias_monthly_ts, order =c(2, 1, 2), include.drift =FALSE)summary(model8)
Series: bias_monthly_ts
ARIMA(2,1,2)
Coefficients:
ar1 ar2 ma1 ma2
-0.5656 -0.2268 -0.2869 -0.0592
s.e. 0.4246 0.2291 0.4205 0.2984
sigma^2 = 0.1044: log likelihood = -21.18
AIC=52.36 AICc=53.18 BIC=64.21
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01835702 0.3128212 0.1953215 127.972 275.8102 0.5815057
ACF1
Training set 0.07124832
Model 1 : ARIMA (1,1,2) has the lowest AIC indicating to be the best model for bias according to this manual selection process
Series: bias_monthly_ts
ARIMA(4,0,1) with zero mean
Coefficients:
ar1 ar2 ar3 ar4 ma1
-0.7682 0.3098 0.4889 0.3909 0.8927
s.e. 0.1176 0.1252 0.1266 0.1098 0.0672
sigma^2 = 0.08966: log likelihood = -15.27
AIC=42.54 AICc=43.7 BIC=56.84
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.03253763 0.2899303 0.1999279 55.82998 278.6028 0.5952198
ACF1
Training set 0.031864
The residual plot does not show consistent fluctuation around zero and no constant variation which shows that the model might need improvement
The ACF plot does not reveal any significant correlations
The Q-Q plot indicates that the residuals follow a slightly near-normal distribution, with major deviations at the tails,
The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying no significant autocorrelations left in the residuals and concluding that the model is adequate
The residual plot does not show consistent fluctuation around zero and no constant variation which shows that the model might need improvement
The ACF plot does not reveal any significant correlations
The Q-Q plot indicates that the residuals follow a near-normal distribution, with minor deviations at the tails,
The Ljung-Box Test p-values remain mostly above the 0.05 significance level, implying no significant autocorrelations left in the residuals and concluding that the model might be adequate
Both models perform similarly, with the residual, acf, and normal q-q plot very similar. The Ljung-Box statistic plot with the automatic plot producing p-values that are significantly > 0.05
Benchmark methods and models for bias
# Define model orders for comparisonmanual_order <-c(1, 1, 2)auto_order <-c(4, 0, 1)
# Split data for evaluation (use last 6 observations for testing)n <-length(bias_monthly_ts)train_data <- bias_monthly_ts[1:(n-6)]test_data <- bias_monthly_ts[(n-5):n]# Fit manual model on training datamanual_arima <-Arima(train_data, order = manual_order, include.drift =FALSE)manual_forecast <-forecast(manual_arima, h =6)# Fit automatic model on training dataauto_arima <-Arima(train_data, order = auto_order, include.drift =FALSE)auto_forecast <-forecast(auto_arima, h =6)
# Benchmark methods# Naive methodnaive_forecast <-naive(train_data, h =6)# Mean method mean_forecast <-meanf(train_data, h =6)# Random walk with driftrwf_forecast <-rwf(train_data, h =6, drift =TRUE)# Calculate accuracy measuresmanual_accuracy <-accuracy(manual_forecast, test_data)auto_accuracy <-accuracy(auto_forecast, test_data)naive_accuracy <-accuracy(naive_forecast, test_data)mean_accuracy <-accuracy(mean_forecast, test_data)rwf_accuracy <-accuracy(rwf_forecast, test_data)# Create comparison tablecomparison <-data.frame(Method =c("Manual ARIMA", "Auto ARIMA", "Naive", "Mean", "RWF with Drift"),MAE =c(manual_accuracy[2,3], auto_accuracy[2,3], naive_accuracy[2,3], mean_accuracy[2,3], rwf_accuracy[2,3]),RMSE =c(manual_accuracy[2,2], auto_accuracy[2,2], naive_accuracy[2,2], mean_accuracy[2,2], rwf_accuracy[2,2]),MAPE =c(manual_accuracy[2,5], auto_accuracy[2,5], naive_accuracy[2,5], mean_accuracy[2,5], rwf_accuracy[2,5]))
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Warning: 'xfun::attr()' is deprecated.
Use 'xfun::attr2()' instead.
See help("Deprecated")
Model Performance Comparison
Method
MAE
RMSE
MAPE
Manual ARIMA
0.024
0.029
40.491
Auto ARIMA
0.032
0.037
73.795
Naive
0.012
0.016
29.161
Mean
0.063
0.065
137.971
RWF with Drift
0.040
0.047
71.534
The Naive model has a lower MAE, RMSE, and MAPE, outperforming both the Manual ARIMA and Auto Arima
Plot forecasts with manual arima as chosen model to include with baseline models as it outperforms auto arima
# Forecast visualization# Fit the best manual model on full databest_manual_fit <-Arima(bias_monthly_ts, order = manual_order, include.drift =FALSE)# Create forecast plot comparing all methodsautoplot(bias_monthly_ts) +autolayer(meanf(bias_monthly_ts, h =60), series ="Mean", PI =FALSE) +autolayer(naive(bias_monthly_ts, h =60), series ="Naïve", PI =FALSE) +autolayer(rwf(bias_monthly_ts, drift =TRUE, h =60), series ="RWF with Drift", PI =FALSE) +autolayer(forecast(best_manual_fit, h =60), series ="Manual ARIMA", PI =FALSE) +ggtitle("Bias Forecast Comparison") +xlab("Time") +ylab("Bias (health, science, and general category)") +guides(colour =guide_legend(title ="Forecast Methods")) +theme_minimal()